In this post we describe several variations of normalized gradient descent, known generally as steepest descent algorithms. These variations arise out of a small but fundamental change to the gradient descent setup, a change that has significant impact on the form of a descent direction but, importantly, not on the convergence of the methods (steepest descent methods converge under the same general condition as gradient descent, i.e., that the length of the steps eventually diminishes).
Because normalized gradient approaches have the benefit of being able to push through flat regions of a non-convex function (as we saw earlier in this series), many popular optimizers for deep neural networks are built on the basic methods presented here - in particular the $\ell_{\infty}$ steepest descent step, for example, the Resilient Backpropagation algorithm or Rprop, as well as its variations like Root Mean Square Propagation or RMSprop.
# imports from custom library
import sys
sys.path.append('../../')
import matplotlib.pyplot as plt
from mlrefined_libraries import basics_library as baslib
from mlrefined_libraries import calculus_library as callib
from mlrefined_libraries import math_optimization_library as optlib
from mlrefined_libraries import linear_algebra_library as linlib
import autograd.numpy as np
import time
#this is needed to compensate for matplotlib notebook's tendancy to blow up images when plotted inline
%matplotlib notebook
from matplotlib import rcParams
rcParams['figure.autolayout'] = True
%load_ext autoreload
%autoreload 2
When we first discussed gradient descent the descent direction given by the negative gradient fell out naturally from a geometric understanding of hyperplanes. Here we begin our discussion of steepest descent algorithms by re-visiting gradient descent and deriving the descent direction using a more rigorous mathematical framework. This framework immediately suggests a way to generalize the gradient descent algorithm under different norms, which we then explore.
When first deriving the gradient descent direction of a multi-input function $g(\mathbf{w})$ at a point $\mathbf{v}$ we began by examining the tangent hyperplane to $g$ at this point
\begin{equation} h\left(\mathbf{w}\right)=g\left(\mathbf{v}\right)+\nabla g\left(\mathbf{v}\right)^{T}\left(\mathbf{w}-\mathbf{v}\right) \end{equation}We then reasoned out that since in general we know that the ascent direction on a hyperplane is given by its set of 'slopes' - stored in $\nabla g(\mathbf{v})$ - that intuitively therefore the descent direction is given by the negative gradient as $-\nabla g(\mathbf{v})$, or normalized to unit length as $-\frac{\nabla g(\mathbf{v})}{\left\Vert \nabla g(\mathbf{v}) \right\Vert_2 }$. We often used the latter unit-normalized version since after all we care only about the descent direction.
This descent direction can be derived more formally as follows. Note that $\mathbf{d} = \mathbf{w} - \mathbf{v}$ is a general search direction centered at the point $\mathbf{v}$. We want to find the unit-length direction $\mathbf{d}$ that provides the smallest evaluation on the hyperplane, i.e., the one that gives the smallest value $g(\mathbf{v}) + \nabla g(\mathbf{v})^T(\mathbf{w} - \mathbf{v}) = g(\mathbf{v}) + \nabla g(\mathbf{v})^T\mathbf{d}$. Since $\mathbf{d}$ is only present in the second term we only need care about finding the unit length direction $\mathbf{d}$ so that $\nabla g(\mathbf{v})^T\mathbf{d}$ is as small as possible (after all, $g(\mathbf{v})$ is constant).
Formally this is a simple constrained minimization problem
Using the toy in the next Python cell we explore possible solutions to this problem in two dimensions for a particular choice of gradient vector, gaining valuable geometric intuition as to what the solution should be. There we plot an example gradient vector $\nabla g (\mathbf{v})$ as a red arrow, along with the $\ell_2$ unit ball. Moving the slider from left to right you can test various directions $\mathbf{d}$ (each shown as a black arrow) on this $\ell_2$ unit ball computing the inner product $\nabla g(\mathbf{v})^{T}\mathbf{d}$, whose value is simultaneously plotted in the right panel. As you move the slider right the direction providing the smallest value is shown as a green arrow in the left panel, and is highlighted in green in the plot in the right panel.
# create an animation showing the origin of the sine and cosine functions
num_frames = 100
pt = [1.3,0.75]
optlib.steepest_descent_direction.L2(pt,num_frames)
As can be seen it appears as if indeed $\mathbf{d} = -\frac{\nabla g(\mathbf{v})}{\left\Vert \nabla g(\mathbf{v}) \right\Vert_2 }$ is the direction producing the smallest inner product.
To prove this formally we can use the inner product rule which tells us almost immediately what the solution to this problem is. According to the inner product rule $\nabla g(\mathbf{v})^{T}\mathbf{d}$ can be written as
$$\nabla g(\mathbf{v})^{T}\mathbf{d} = \left \Vert \nabla g(\mathbf{v}) \right \Vert_2 \left \Vert \mathbf{d} \right \Vert_2 \text{cos}(\theta)$$where $\theta$ is the angle between $\nabla g(\mathbf{v})$ and $\mathbf{d}$.
Noting that both $\left \Vert \nabla g(\mathbf{v}) \right \Vert_2$ and $\left \Vert \mathbf{d} \right \Vert_2$ have constant values (the former is the length of the gradient at $\mathbf{v}$, and latter is just $1$), the value of $\nabla g(\mathbf{v})^{T}\mathbf{d}$ is smallest when $\text{cos}(\theta)$ is smallest, i.e., when $\theta = \pi$. In other words, $\mathbf{d}$ must point in the same direction as $-\nabla g(\mathbf{v})$, and also be unit length. Thus we have $\mathbf{d} = -\frac{\nabla g(\mathbf{v})}{\left \Vert \nabla g(\mathbf{v}) \right \Vert_2}$, that is indeed the normalized gradient descent direction. In deriving the gradient descent direction in this way, $\mathbf{d}$ is referred to more generally as the steepest descent direction.
With this setup we can actually choose any norm in the constraint that we wish to determine a descent direction. For example, replacing the $\ell_2$ with the $\ell_1$ norm we have a similar looking problem
whose solution defines a new steepest descent direction with respect to the $\ell_1$ norm. The direction we find here will certainly be different than the $\ell_2$ constrained version.
Using the toy in the next Python cell we explore possible solutions to this problem in two dimensions for a particular choice of gradient vector, gaining valuable geometric intuition as to what the solution should be. There we plot an example gradient vector $\nabla g (\mathbf{v})$ as a red arrow, along with the $\ell_1$ unit ball. Moving the slider from left to right you can test various directions $\mathbf{d}$ (each shown as a black arrow) on this $\ell_1$ unit ball computing the inner product $\nabla g(\mathbf{v})^{T}\mathbf{d}$, whose value is simultaneously plotted in the right panel. As you move the slider right the direction providing the smallest value is shown as a green arrow in the left panel, and is highlighted in green in the plot in the right panel.
# create an animation showing the origin of the sine and cosine functions
num_frames = 100
pt = [1.3,0.75]
optlib.steepest_descent_direction.L1(pt,num_frames)
Let us try a different a gradient vector $\nabla g (\mathbf{v})$, this time in the second quadrant.
# create an animation showing the origin of the sine and cosine functions
num_frames = 100
pt = [-0.3,1.2]
optlib.steepest_descent_direction.L1(pt,num_frames)